“Given that it costs 10 times more to acquire a new customer than to retain an existing one, customer retention has now become even more important than customer acquisition. For many providers, retaining highly profitable customers is the number one business goal. To reduce customer churn, telecommunications companies need to predict which customers are at high risk of churn.” (D207 D208 D209 Churn Data Consideration and Dictionary.pdf) In addition to prediction, the company may be interested in finding the main factors that may affect the churn possibility positively and negatively, therefore taking the right actions to deal with those factors. The analysis of the data would give the company and the stakeholders a good idea about these factors and the degree of their reliability.
The question to be asked is about recognizing which variables have a relationship with the customer's Churn, and using these relationships to predict customers churn probability
logistic regression will be used to check the factors and features that affect the customers churn probability and to build a prediction model based on these features.
the churn probability of each customer may depend on several factors such as customer satisfaction, services quality and price. etc, some of these factors may be more important than others. and more significant than others. the objective of the data analysis is to identify these features and to test their significance in order to build a logistic regression prediction model that can be used to predict the customer churn probability based on available features or criteria. thus giving the stakeholders the insight to avoid the negative factors and to support the positive ones that may decrease the customers churn probability.
APPROPRIATE OUTCOME STRUCTURE Binary logistic regression requires the dependent variable to be binary and ordinal logistic regression requires the dependent variable to be ordinal.
OBSERVATION INDEPENDENCE Logistic regression requires the observations to be independent of each other. In other words, the observations should not come from repeated measurements or matched data.
ABSENCE OF MULTICOLLINEARITY Logistic regression requires there to be little or no multicollinearity among the independent variables. This means that the independent variables should not be too highly correlated with each other.
LINEARITY OF INDEPENDENT VARIABLES AND LOG ODDS Logistic regression assumes linearity of independent variables and log odds. Although this analysis does not require the dependent and independent variables to be related linearly, it requires that the independent variables are linearly related to the log odds.
LARGE SAMPLE SIZE logistic regression typically requires a large sample size. A general guideline is that you need at minimum of 10 cases with the least frequent outcome for each independent variable in your model. For example, if you have 5 independent variables and the expected probability of your least frequent outcome is .10, then you would need a minimum sample size of 500 (10*5 / .10).
https://www.statology.org/assumptions-of-logistic-regression/
Selected Python, the general-purpose, interpreted, object-oriented language, which supports many useful packages for creating linear models Selected Python libraries such as:
The Churn is a binary categorical variable (Yes/No), it's probability may depend on several factors such as customer satisfaction, services quality and price. etc, some of these factors may be more important than others. and more significant than others.
Logistic regression is an appropriate technique to check the factors and features that affect the customers churn probability and the significance of each of them, by modelling the relationship between multiple explanatory variables to the single dependent binary categorical variable ('Churn'), through classification and thus estimating the probability of the Churn to happen.
#imporing the needed libraries.
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None) #modified Pandas options to fully display the large dataset when needed
## !!! Trying to Display a full Dataframe may cause a problem.
import numpy as np
np.set_printoptions(precision=5, suppress=True)
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import matplotlib as mpl
import sklearn
from sklearn.datasets import load_boston
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
from sklearn import linear_model
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
#Reading the Data file (As received) churn_clean.csv
df_raw= pd.read_csv('churn_clean.csv')
target='Churn'
df_raw.head()
| CaseOrder | Customer_id | Interaction | UID | City | State | County | Zip | Lat | Lng | Population | Area | TimeZone | Job | Children | Age | Income | Marital | Gender | Churn | Outage_sec_perweek | Contacts | Yearly_equip_failure | Techie | Contract | Port_modem | Tablet | InternetService | Phone | Multiple | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | PaperlessBilling | PaymentMethod | Tenure | MonthlyCharge | Bandwidth_GB_Year | Item1 | Item2 | Item3 | Item4 | Item5 | Item6 | Item7 | Item8 | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | K409198 | aa90260b-4141-4a24-8e36-b04ce1f4f77b | e885b299883d4f9fb18e39c75155d990 | Point Baker | AK | Prince of Wales-Hyder | 99927 | 56.25100 | -133.37571 | 38 | Urban | America/Sitka | Environmental health practitioner | 0 | 68 | 28561.99 | Widowed | Male | No | 7.978323 | 10 | 0 | 1 | No | One year | Yes | Yes | Fiber Optic | Yes | No | Yes | Yes | No | No | No | Yes | Yes | Credit Card (automatic) | 6.795513 | 172.455519 | 904.536110 | 5 | 5 | 5 | 3 | 4 | 4 | 3 | 4 |
| 1 | 2 | S120509 | fb76459f-c047-4a9d-8af9-e0f7d4ac2524 | f2de8bef964785f41a2959829830fb8a | West Branch | MI | Ogemaw | 48661 | 44.32893 | -84.24080 | 10446 | Urban | America/Detroit | Programmer, multimedia | 1 | 27 | 21704.77 | Married | Female | Yes | 11.699080 | 12 | 0 | 1 | Yes | Month-to-month | No | Yes | Fiber Optic | Yes | Yes | Yes | No | No | No | Yes | Yes | Yes | Bank Transfer(automatic) | 1.156681 | 242.632554 | 800.982766 | 3 | 4 | 3 | 3 | 4 | 3 | 4 | 4 |
| 2 | 3 | K191035 | 344d114c-3736-4be5-98f7-c72c281e2d35 | f1784cfa9f6d92ae816197eb175d3c71 | Yamhill | OR | Yamhill | 97148 | 45.35589 | -123.24657 | 3735 | Urban | America/Los_Angeles | Chief Financial Officer | 4 | 50 | 9609.57 | Widowed | Female | No | 10.752800 | 9 | 0 | 1 | Yes | Two Year | Yes | No | DSL | Yes | Yes | No | No | No | No | No | Yes | Yes | Credit Card (automatic) | 15.754144 | 159.947583 | 2054.706961 | 4 | 4 | 2 | 4 | 4 | 3 | 3 | 3 |
| 3 | 4 | D90850 | abfa2b40-2d43-4994-b15a-989b8c79e311 | dc8a365077241bb5cd5ccd305136b05e | Del Mar | CA | San Diego | 92014 | 32.96687 | -117.24798 | 13863 | Suburban | America/Los_Angeles | Solicitor | 1 | 48 | 18925.23 | Married | Male | No | 14.913540 | 15 | 2 | 0 | Yes | Two Year | No | No | DSL | Yes | No | Yes | No | No | No | Yes | No | Yes | Mailed Check | 17.087227 | 119.956840 | 2164.579412 | 4 | 4 | 4 | 2 | 5 | 4 | 3 | 3 |
| 4 | 5 | K662701 | 68a861fd-0d20-4e51-a587-8a90407ee574 | aabb64a116e83fdc4befc1fbab1663f9 | Needville | TX | Fort Bend | 77461 | 29.38012 | -95.80673 | 11352 | Suburban | America/Chicago | Medical illustrator | 0 | 83 | 40074.19 | Separated | Male | Yes | 8.147417 | 16 | 2 | 1 | No | Month-to-month | Yes | No | Fiber Optic | No | No | No | No | No | Yes | Yes | No | No | Mailed Check | 1.670972 | 149.948316 | 271.493436 | 4 | 4 | 4 | 3 | 4 | 4 | 4 | 5 |
df_raw.info() #Data Information of the Raw DataFrame (As received)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 50 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CaseOrder 10000 non-null int64 1 Customer_id 10000 non-null object 2 Interaction 10000 non-null object 3 UID 10000 non-null object 4 City 10000 non-null object 5 State 10000 non-null object 6 County 10000 non-null object 7 Zip 10000 non-null int64 8 Lat 10000 non-null float64 9 Lng 10000 non-null float64 10 Population 10000 non-null int64 11 Area 10000 non-null object 12 TimeZone 10000 non-null object 13 Job 10000 non-null object 14 Children 10000 non-null int64 15 Age 10000 non-null int64 16 Income 10000 non-null float64 17 Marital 10000 non-null object 18 Gender 10000 non-null object 19 Churn 10000 non-null object 20 Outage_sec_perweek 10000 non-null float64 21 Email 10000 non-null int64 22 Contacts 10000 non-null int64 23 Yearly_equip_failure 10000 non-null int64 24 Techie 10000 non-null object 25 Contract 10000 non-null object 26 Port_modem 10000 non-null object 27 Tablet 10000 non-null object 28 InternetService 10000 non-null object 29 Phone 10000 non-null object 30 Multiple 10000 non-null object 31 OnlineSecurity 10000 non-null object 32 OnlineBackup 10000 non-null object 33 DeviceProtection 10000 non-null object 34 TechSupport 10000 non-null object 35 StreamingTV 10000 non-null object 36 StreamingMovies 10000 non-null object 37 PaperlessBilling 10000 non-null object 38 PaymentMethod 10000 non-null object 39 Tenure 10000 non-null float64 40 MonthlyCharge 10000 non-null float64 41 Bandwidth_GB_Year 10000 non-null float64 42 Item1 10000 non-null int64 43 Item2 10000 non-null int64 44 Item3 10000 non-null int64 45 Item4 10000 non-null int64 46 Item5 10000 non-null int64 47 Item6 10000 non-null int64 48 Item7 10000 non-null int64 49 Item8 10000 non-null int64 dtypes: float64(7), int64(16), object(27) memory usage: 3.8+ MB
raw_variables=['CaseOrder', 'Customer_id', 'Interaction', 'UID', 'City', 'State',
'County', 'Zip', 'Lat', 'Lng', 'Population', 'Area', 'TimeZone', 'Job',
'Children', 'Age', 'Income', 'Marital', 'Gender', 'Churn',
'Outage_sec_perweek', 'Email', 'Contacts', 'Yearly_equip_failure',
'Techie', 'Contract', 'Port_modem', 'Tablet', 'InternetService',
'Phone', 'Multiple', 'OnlineSecurity', 'OnlineBackup',
'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies',
'PaperlessBilling', 'PaymentMethod', 'Tenure', 'MonthlyCharge',
'Bandwidth_GB_Year', 'Item1', 'Item2', 'Item3', 'Item4', 'Item5',
'Item6', 'Item7', 'Item8']
num_variables = ['CaseOrder', 'Zip', 'Lat', 'Lng', 'Population', 'Children',
'Age','Income', 'Outage_sec_perweek', 'Email', 'Contacts',
'Yearly_equip_failure', 'Tenure', 'MonthlyCharge', 'Bandwidth_GB_Year',
'Item1', 'Item2', 'Item3', 'Item4', 'Item5', 'Item6', 'Item7', 'Item8']
categ_variables = ['Customer_id', 'Interaction', 'UID', 'City', 'State', 'County', 'Area',
'TimeZone', 'Job', 'Marital', 'Gender', 'Churn', 'Techie', 'Contract',
'Port_modem', 'Tablet', 'InternetService', 'Phone', 'Multiple',
'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
'StreamingTV', 'StreamingMovies', 'PaperlessBilling', 'PaymentMethod']
print('raw_variables:',len(raw_variables),' num_variables:',len(num_variables),' cat_variables:',len(categ_variables))
raw_variables: 50 num_variables: 23 cat_variables: 27
#Summary statistics of Numerical variables of the Raw DataFrame
df_raw.describe().T #Transposed to show large number of variables displayed as rows
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| CaseOrder | 10000.0 | 5000.500000 | 2886.895680 | 1.000000 | 2500.750000 | 5000.500000 | 7500.250000 | 10000.000000 |
| Zip | 10000.0 | 49153.319600 | 27532.196108 | 601.000000 | 26292.500000 | 48869.500000 | 71866.500000 | 99929.000000 |
| Lat | 10000.0 | 38.757567 | 5.437389 | 17.966120 | 35.341828 | 39.395800 | 42.106908 | 70.640660 |
| Lng | 10000.0 | -90.782536 | 15.156142 | -171.688150 | -97.082813 | -87.918800 | -80.088745 | -65.667850 |
| Population | 10000.0 | 9756.562400 | 14432.698671 | 0.000000 | 738.000000 | 2910.500000 | 13168.000000 | 111850.000000 |
| Children | 10000.0 | 2.087700 | 2.147200 | 0.000000 | 0.000000 | 1.000000 | 3.000000 | 10.000000 |
| Age | 10000.0 | 53.078400 | 20.698882 | 18.000000 | 35.000000 | 53.000000 | 71.000000 | 89.000000 |
| Income | 10000.0 | 39806.926771 | 28199.916702 | 348.670000 | 19224.717500 | 33170.605000 | 53246.170000 | 258900.700000 |
| Outage_sec_perweek | 10000.0 | 10.001848 | 2.976019 | 0.099747 | 8.018214 | 10.018560 | 11.969485 | 21.207230 |
| 10000.0 | 12.016000 | 3.025898 | 1.000000 | 10.000000 | 12.000000 | 14.000000 | 23.000000 | |
| Contacts | 10000.0 | 0.994200 | 0.988466 | 0.000000 | 0.000000 | 1.000000 | 2.000000 | 7.000000 |
| Yearly_equip_failure | 10000.0 | 0.398000 | 0.635953 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 6.000000 |
| Tenure | 10000.0 | 34.526188 | 26.443063 | 1.000259 | 7.917694 | 35.430507 | 61.479795 | 71.999280 |
| MonthlyCharge | 10000.0 | 172.624816 | 42.943094 | 79.978860 | 139.979239 | 167.484700 | 200.734725 | 290.160419 |
| Bandwidth_GB_Year | 10000.0 | 3392.341550 | 2185.294852 | 155.506715 | 1236.470827 | 3279.536903 | 5586.141369 | 7158.981530 |
| Item1 | 10000.0 | 3.490800 | 1.037797 | 1.000000 | 3.000000 | 3.000000 | 4.000000 | 7.000000 |
| Item2 | 10000.0 | 3.505100 | 1.034641 | 1.000000 | 3.000000 | 4.000000 | 4.000000 | 7.000000 |
| Item3 | 10000.0 | 3.487000 | 1.027977 | 1.000000 | 3.000000 | 3.000000 | 4.000000 | 8.000000 |
| Item4 | 10000.0 | 3.497500 | 1.025816 | 1.000000 | 3.000000 | 3.000000 | 4.000000 | 7.000000 |
| Item5 | 10000.0 | 3.492900 | 1.024819 | 1.000000 | 3.000000 | 3.000000 | 4.000000 | 7.000000 |
| Item6 | 10000.0 | 3.497300 | 1.033586 | 1.000000 | 3.000000 | 3.000000 | 4.000000 | 8.000000 |
| Item7 | 10000.0 | 3.509500 | 1.028502 | 1.000000 | 3.000000 | 4.000000 | 4.000000 | 7.000000 |
| Item8 | 10000.0 | 3.495600 | 1.028633 | 1.000000 | 3.000000 | 3.000000 | 4.000000 | 8.000000 |
*Function plt_summary() to inspect the data visually usind Histogram, Boxplot and scattered plots. The following function is defined to visually identify statistical parameters and to get the sense from the Data, such as identify the outliers , ranges, dominant values,.etc
'''Custom function to inspect the data visually using Histogram/Bar, Boxplot and scattered plots
this function takes 3 parameters, 1st is DataFrame , 2nd is the preferable plot ("hist" , "box" , " scat" , "target_scat" or "target_histplot" )
The target argument will be used only in the case of target vs variables scat plots, and will not be used for other options
For Histograms with non-numeric columns , the function will use Bar charts (in Green color) instead of Histogram
"scat": plotting scatterd plot against the 1st column of the D.F.
"target_scat": plotting scatterd plot of num. variables against a 'target' column of the D.F.
"target_histplot":plotting histplot of numeric or countplot of cat. variables against the target' column
'''
def plt_summary(df_1,plt_type,target):
numerics = ['uint8','uint16','uint32','int16', 'int32', 'int64', 'float16', 'float32', 'float64']
if plt_type =='target_histplot':
pltlist=df_1.columns
# for number,cols in enumerate(df_1.columns):
# if (len(df_1[cols].value_counts())>=20):
# pltlist.remove(cols)
print("Number of columns: "+str(len(pltlist))) #Printing the number of columns.
elif plt_type =='hist':
pltlist=df_1.columns
print("Number of columns: "+str(df_1.shape[1])) #Printing the number of columns.
else :
df_n = df_1.select_dtypes(include=numerics).copy() # selecting only the numeric columns as it's suitable for mentioned plots
pltlist=df_n.columns
print("Number of Numeric columns: "+str(df_n.shape[1])) #Printing the number of numeric columns.
plt.style.use('default')
plt.figure(figsize=(15,30))
plt_rows = 11
plt_cols = 5
for n, col in enumerate(pltlist): #enumerating each column of the data, starting from index
b=n+1
ax=plt.subplot2grid(shape=(plt_rows,plt_cols),loc=((int(np.ceil(b/plt_cols)-1)),((b-1)%plt_cols)))
if plt_type == 'hist':
if df_1[col].dtype in numerics:
ax.hist(df_1.loc[:,col],bins = 50) #plotting Histogram
ax.set_title(str(col)+' (*num)')
else:
if len(df_1[col].value_counts())>2000:
ax.set_title(str(col)+' ('+str(len(df_1[col].value_counts()))+' unq val.)')
continue
df_1.groupby(col).size().plot.bar(rot=0,color='g') #plotting Bar chart
ax.set_title(str(col)+' (*categ. '+str(len(df_1[col].value_counts()))+' cat.)')
if len(df_1[col].value_counts())>20:
ax.tick_params(bottom=False,labelbottom=False)
elif plt_type == 'box':
msk = ~np.isnan(df_n.loc[:,col])
ax.boxplot(df_n.loc[msk,col]) #plotting Box plot (used a mask to filter the null values)
#ax.boxplot(df.loc[:,col])
ax.set_title(col)
elif plt_type == 'scat':
ax.scatter(df_n.iloc[:,0],df_n.loc[:,col],s=0.1,alpha=0.1) #plotting scatterd plot against the index of the D.F.
ax.set_title(col)
elif plt_type == 'target_scat':
ax.scatter(df_n.loc[:,col],df_n.loc[:,target],s=1,alpha=0.1,color='b') #plotting scatterd plot of num. variables against the 'target' column of the D.F.
#ax.set_title(col)
ax.set_xlabel(col)
elif plt_type == 'target_hexbin':
ax = df_n.plot.hexbin(ax=ax,x=col, y=target,gridsize=50, sharex=False) #plotting hexbin plot
ax.set_title(col)
elif (plt_type == 'target_histplot'):
if (df_1[target].dtype not in numerics):
ax.set_title(str(col)+' (*categ. '+str(len(df_1[col].value_counts()))+' cat.)')
if (len(df_1[col].value_counts())<=700):
sns.countplot(ax=ax, data = df_1,x = col,hue = target,alpha=0.7) #plotting countplot of cat. variables against the 'target' column
if len(df_1[col].value_counts())>20:
ax.tick_params(bottom=False,labelbottom=False)
elif (df_1[col].dtype in numerics):
sns.histplot(df_1, x=col, hue=target, element="poly",ax=ax) #plotting histplot of num. variables against the 'target' column
ax.set_title(str(col)+' (*num)')
# ax.plt.legend(fontsize='xx-small', title_fontsize='xx-small')
# plt.setp(ax.get_legend().get_texts(), fontsize='5') # for legend text
# plt.setp(ax.get_legend().get_title(), fontsize='32') # for legend title
elif df_1[target].dtype in numerics:
if (df_1[col].dtype not in numerics):
ax.set_title(str(col)+' (*categ. '+str(len(df_1[col].value_counts()))+' cat.)')
if (len(df_1[col].value_counts())<=5):
sns.histplot(df_1, x=target, hue=col, element="poly",ax=ax) #plotting histplot of num. variables against the 'target' column
elif (df_1[col].dtype in numerics):
ax = df_1.plot.hexbin(ax=ax,x=target, y=col,gridsize=30, sharex=False, cmap="cubehelix_r") #plotting hexbin plot
# continue
plt.tight_layout()
plt.show()
return
*Function cat2num() : to convert categorical variables into serial numeric values in integer format
'''
Custom function to convert categorical variables into numeric form.
this function takes 1 parameters, input DataFrame and creates a dectionary for each categorical variable
where dictinary keys are the categogries sorted alphabetically and the values are serial numeric values
the maximum number of categories this function can handle is 7000
'''
def cat2num(df_1):
numerics = ['uint8','int16', 'int32', 'int64', 'float16', 'float32', 'float64']
df_num_conv=df_1.copy()
for n, col in enumerate(df_num_conv.columns):
if (df_num_conv[col].dtype in numerics) or (len(df_num_conv[col].value_counts())>7000):
continue
else:
cat_dict={y:x for x,y in enumerate(dict(df_raw[col].value_counts().sort_index(ascending=True)))}
df_num_conv[col]=df_num_conv[col].replace(cat_dict)
return df_num_conv
# To be able to correlate with the categorical variables
df_num=cat2num(df_raw)
%%time
#calling the plt_summary to the Raw DataFrame
plt_summary(df_raw,"hist",target)
Number of columns: 50
Wall time: 37.3 s
%%time
#calling the plt_summary to the Raw DataFrame
plt_summary(df_raw,"target_histplot",target)
Number of columns: 50
Wall time: 37.2 s
%%time
#calling the plt_summary to the num DataFrame
plt_summary(df_num,'target_scat',target)
Number of Numeric columns: 47
Wall time: 14.6 s
*Function plot_corr_ellipses() imported and modified From text Book "Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python",(Bruce et al.,2020) - The associated GitHub code repository
'''
From text Book "Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python"
The associated GitHub code repository
'''
from matplotlib.collections import EllipseCollection
from matplotlib.colors import Normalize
def plot_corr_ellipses(data, figsize=None, **kwargs):
''' https://stackoverflow.com/a/34558488 '''
M = np.array(data)
if not M.ndim == 2:
raise ValueError('data must be a 2D array')
fig, ax = plt.subplots(1, 1, figsize=figsize, subplot_kw={'aspect':'equal'})
ax.set_xlim(-0.5, M.shape[1] - 0.5)
ax.set_ylim(-0.5, M.shape[0] - 0.5)
ax.invert_yaxis()
# xy locations of each ellipse center
xy = np.indices(M.shape)[::-1].reshape(2, -1).T
# set the relative sizes of the major/minor axes according to the strength of
# the positive/negative correlation
w = np.ones_like(M).ravel() + 0.01
h = 1 - np.abs(M).ravel() - 0.01
a = 45 * np.sign(M).ravel()
ec = EllipseCollection(widths=w, heights=h, angles=a, units='x', offsets=xy,
norm=Normalize(vmin=-1, vmax=1),
transOffset=ax.transData, array=M.ravel(), **kwargs)
ax.add_collection(ec)
# if data is a DataFrame, use the row/column names as tick labels
if isinstance(data, pd.DataFrame):
ax.set_xticks(np.arange(M.shape[1]))
ax.set_xticklabels(data.columns, rotation=90)
ax.tick_params(axis="x", bottom=True, top=True, labelbottom=True, labeltop=True)
ax.set_yticks(np.arange(M.shape[0]))
ax.set_yticklabels(data.index)
ax.grid (True, color = "grey", linewidth = "0.5", linestyle = "-")
return ec
ax= plot_corr_ellipses(df_num.corr(), figsize=(12, 12), cmap='bwr_r')
plt.colorbar(ax)
plt.tight_layout()
plt.show()
#Principal Component Analysis
df_numpca =df_num.copy()
#Dropping target column
df_numpca.drop(columns=target,inplace=True)
#Dropping irrelevant columns
drop_cols = ['CaseOrder','City','State','County','Zip','Lng','Lat','Population','Area','TimeZone']
df_numpca.drop(columns=drop_cols,inplace=True)
numerics = ['uint8','int16', 'int32', 'int64', 'float16', 'float32', 'float64']
df_numpca = df_numpca.select_dtypes(include=numerics)
#Principal Component Analysis and plot
df_numpca_normalized=(df_numpca-df_numpca.mean())/df_numpca.std()
pca= PCA(n_components=df_numpca.shape[1])
pca.fit(df_numpca_normalized)
pca_names=['PC'+str(i) for i in range (1,len(df_numpca.columns)+1)]
df_numpca_pca=pd.DataFrame(pca.transform(df_numpca_normalized), columns = [*pca_names])
plt.plot(pca.explained_variance_ratio_,'o-')
plt.title('PCA graph')
plt.xlabel('Components')
plt.ylabel('Explained variance')
plt.ylim([-0.01,0.16])
#plt.xlim([-1,40])
plt.show()
#Eigenvalues and plot
cov_matrix= np.dot(df_numpca_normalized.T,df_numpca_normalized)/df_numpca.shape[0]
eigenvalues =[np.dot(eigenvector.T, np.dot(cov_matrix, eigenvector)) for eigenvector in pca.components_]
plt.plot(eigenvalues,'o-')
plt.xlabel('number of components')
plt.ylabel('eigenvalue')
plt.show()
#Loadings
loadings = pd.DataFrame(pca.components_.T,columns = [*pca_names],index=df_numpca.columns)
pca_xplots = [(0,1),(0,2),(0,3),(0,4),(1,2),(1,3),(1,4),(2,3),(2,4),(3,4)]
for j,k in (pca_xplots):
plt.figure(figsize=(5,5))
plt.scatter(df_numpca_pca.iloc[:,j],df_numpca_pca.iloc[:,k],s=10,alpha=0.4,c=df_num.loc[:,target],cmap='plasma')
plt.colorbar();plt.grid('on')
plt.title(pca_names[j] +" Vs "+ pca_names[k]+" - Colored by " + target );plt.xlabel(pca_names[j]);plt.ylabel(pca_names[k])
plt.ylim([-0.01,0.16]);plt.ylim([-6,6]);plt.xlim([-4,4])
plt.show()
PCsortingcomponent=3
print('Sorted by: '+ pca_names[PCsortingcomponent-1])
display(loadings.abs().sort_values(by=pca_names[PCsortingcomponent-1],ascending =False))
print(pca_names[PCsortingcomponent-1] +' Main contributors sorted: \n\n',loadings.iloc[:,PCsortingcomponent-1].abs().sort_values(ascending =False))
#display(loadings.abs().sort_values(by=pca_names,ascending =False))
Sorted by: PC3
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | PC11 | PC12 | PC13 | PC14 | PC15 | PC16 | PC17 | PC18 | PC19 | PC20 | PC21 | PC22 | PC23 | PC24 | PC25 | PC26 | PC27 | PC28 | PC29 | PC30 | PC31 | PC32 | PC33 | PC34 | PC35 | PC36 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MonthlyCharge | 0.001285 | 0.364382 | 0.602709 | 0.029649 | 0.007048 | 0.003271 | 0.027586 | 0.008492 | 0.010440 | 0.020245 | 0.019993 | 0.005582 | 0.001401 | 2.025766e-03 | 0.014034 | 0.000464 | 0.001448 | 0.022201 | 0.008135 | 0.005279 | 0.016415 | 0.000929 | 0.001805 | 0.011384 | 0.005252 | 0.006612 | 0.005122 | 0.001604 | 0.001681 | 0.002346 | 0.002747 | 0.000918 | 0.001311 | 0.001773 | 0.704294 | 0.062417 |
| StreamingMovies | 0.008058 | 0.237248 | 0.394214 | 0.017016 | 0.050453 | 0.106949 | 0.037212 | 0.325792 | 0.224559 | 0.186272 | 0.030097 | 0.015183 | 0.125978 | 1.685546e-01 | 0.092150 | 0.015744 | 0.112597 | 0.053518 | 0.335520 | 0.080264 | 0.121558 | 0.235488 | 0.078344 | 0.011521 | 0.249401 | 0.064931 | 0.092823 | 0.162884 | 0.010067 | 0.003142 | 0.012451 | 0.007697 | 0.007027 | 0.005016 | 0.457137 | 0.071878 |
| Tenure | 0.016285 | 0.584647 | 0.384277 | 0.067919 | 0.006886 | 0.028829 | 0.012427 | 0.001135 | 0.037304 | 0.001698 | 0.013335 | 0.049343 | 0.005051 | 2.449841e-02 | 0.021186 | 0.013863 | 0.037436 | 0.017515 | 0.010697 | 0.000651 | 0.006930 | 0.005999 | 0.007323 | 0.040677 | 0.017870 | 0.028046 | 0.006480 | 0.012467 | 0.016946 | 0.003586 | 0.007405 | 0.004930 | 0.005731 | 0.005041 | 0.091396 | 0.696867 |
| Bandwidth_GB_Year | 0.016954 | 0.613473 | 0.339510 | 0.064940 | 0.002419 | 0.003731 | 0.007333 | 0.012940 | 0.010799 | 0.016610 | 0.014139 | 0.007892 | 0.005589 | 6.246236e-03 | 0.005376 | 0.004548 | 0.015013 | 0.024140 | 0.015407 | 0.003329 | 0.012294 | 0.002515 | 0.015492 | 0.022509 | 0.005226 | 0.002479 | 0.000420 | 0.008075 | 0.008588 | 0.002884 | 0.004335 | 0.005958 | 0.001522 | 0.002809 | 0.085169 | 0.702427 |
| StreamingTV | 0.001095 | 0.198275 | 0.300744 | 0.009224 | 0.029070 | 0.080746 | 0.014053 | 0.208239 | 0.041458 | 0.236647 | 0.212901 | 0.164303 | 0.399752 | 4.143699e-01 | 0.100903 | 0.056866 | 0.055193 | 0.175299 | 0.019220 | 0.082895 | 0.027893 | 0.275454 | 0.055770 | 0.047851 | 0.204944 | 0.172203 | 0.036806 | 0.082011 | 0.031034 | 0.033446 | 0.010367 | 0.003003 | 0.026150 | 0.008086 | 0.366009 | 0.067444 |
| Multiple | 0.000908 | 0.135973 | 0.263199 | 0.026448 | 0.085019 | 0.023985 | 0.127042 | 0.238867 | 0.294987 | 0.241846 | 0.079337 | 0.048894 | 0.343688 | 3.465556e-01 | 0.092878 | 0.047747 | 0.095302 | 0.150578 | 0.175750 | 0.061256 | 0.075997 | 0.015780 | 0.169579 | 0.085792 | 0.025621 | 0.251608 | 0.308888 | 0.264158 | 0.049784 | 0.040545 | 0.018372 | 0.026341 | 0.013148 | 0.020575 | 0.287312 | 0.035020 |
| OnlineBackup | 0.005435 | 0.126439 | 0.147357 | 0.026541 | 0.171265 | 0.172518 | 0.136405 | 0.104502 | 0.084772 | 0.089049 | 0.374386 | 0.060453 | 0.059879 | 5.269687e-02 | 0.418059 | 0.118995 | 0.095956 | 0.226102 | 0.421092 | 0.323466 | 0.000369 | 0.164447 | 0.053488 | 0.254729 | 0.037556 | 0.169588 | 0.089025 | 0.050781 | 0.002498 | 0.028809 | 0.004051 | 0.011703 | 0.017251 | 0.012148 | 0.196379 | 0.031164 |
| DeviceProtection | 0.002992 | 0.043178 | 0.138106 | 0.026027 | 0.196919 | 0.210258 | 0.266805 | 0.031075 | 0.171872 | 0.358288 | 0.167195 | 0.124920 | 0.030528 | 1.102958e-02 | 0.199079 | 0.069690 | 0.237362 | 0.184814 | 0.109487 | 0.112244 | 0.328262 | 0.106763 | 0.022483 | 0.146565 | 0.324603 | 0.032534 | 0.380642 | 0.230607 | 0.036037 | 0.035076 | 0.022053 | 0.006734 | 0.020150 | 0.000837 | 0.107368 | 0.022680 |
| TechSupport | 0.024689 | 0.037298 | 0.070719 | 0.004073 | 0.130928 | 0.138021 | 0.141013 | 0.249824 | 0.239382 | 0.059862 | 0.335770 | 0.346142 | 0.236954 | 5.427333e-02 | 0.222510 | 0.167514 | 0.160677 | 0.031702 | 0.112471 | 0.288945 | 0.338526 | 0.120777 | 0.130287 | 0.190587 | 0.189270 | 0.130858 | 0.093199 | 0.253602 | 0.037482 | 0.010842 | 0.029145 | 0.003353 | 0.003343 | 0.011065 | 0.101879 | 0.009841 |
| Item5 | 0.175623 | 0.037677 | 0.054942 | 0.583769 | 0.044271 | 0.014975 | 0.001318 | 0.014061 | 0.002230 | 0.006553 | 0.017528 | 0.005516 | 0.005571 | 5.641963e-03 | 0.029841 | 0.006291 | 0.032988 | 0.000607 | 0.000460 | 0.006193 | 0.003735 | 0.030391 | 0.030530 | 0.009909 | 0.007575 | 0.033785 | 0.007436 | 0.011878 | 0.139698 | 0.061181 | 0.197694 | 0.697566 | 0.254665 | 0.043372 | 0.001922 | 0.000235 |
| Item4 | 0.145862 | 0.025952 | 0.039026 | 0.566678 | 0.012808 | 0.012393 | 0.011625 | 0.005350 | 0.000447 | 0.017531 | 0.011741 | 0.000174 | 0.010546 | 3.167788e-02 | 0.005627 | 0.022242 | 0.015278 | 0.003420 | 0.013022 | 0.013345 | 0.042019 | 0.014795 | 0.015553 | 0.008952 | 0.030660 | 0.019830 | 0.056931 | 0.005384 | 0.168543 | 0.475473 | 0.437354 | 0.439224 | 0.079101 | 0.019971 | 0.000417 | 0.000063 |
| Item3 | 0.400198 | 0.023213 | 0.032689 | 0.278817 | 0.016708 | 0.020223 | 0.002148 | 0.023312 | 0.003391 | 0.000432 | 0.011656 | 0.032634 | 0.003768 | 1.298912e-02 | 0.020587 | 0.010565 | 0.000896 | 0.038099 | 0.030096 | 0.011887 | 0.025051 | 0.007623 | 0.026769 | 0.005598 | 0.017320 | 0.027441 | 0.006174 | 0.005723 | 0.169994 | 0.252716 | 0.159652 | 0.383482 | 0.675503 | 0.174969 | 0.006563 | 0.000338 |
| OnlineSecurity | 0.000636 | 0.036248 | 0.032530 | 0.011443 | 0.172458 | 0.189093 | 0.133718 | 0.152777 | 0.279610 | 0.152857 | 0.370942 | 0.260268 | 0.116522 | 1.188565e-01 | 0.401894 | 0.098859 | 0.010393 | 0.250912 | 0.105303 | 0.016048 | 0.116814 | 0.083840 | 0.145778 | 0.003089 | 0.166286 | 0.293763 | 0.219989 | 0.319548 | 0.031597 | 0.045354 | 0.008616 | 0.010596 | 0.008083 | 0.003786 | 0.024389 | 0.013404 |
| Outage_sec_perweek | 0.017867 | 0.017278 | 0.029656 | 0.012844 | 0.048228 | 0.185549 | 0.294545 | 0.051296 | 0.037449 | 0.275553 | 0.389688 | 0.166850 | 0.101379 | 6.653331e-02 | 0.242988 | 0.194740 | 0.059392 | 0.087088 | 0.330077 | 0.315283 | 0.262062 | 0.076475 | 0.227236 | 0.122959 | 0.033416 | 0.277637 | 0.160726 | 0.165814 | 0.100967 | 0.015542 | 0.006844 | 0.013641 | 0.018940 | 0.013961 | 0.001159 | 0.000121 |
| Item2 | 0.433652 | 0.030724 | 0.027409 | 0.281293 | 0.022663 | 0.001292 | 0.014046 | 0.031478 | 0.004271 | 0.013383 | 0.006628 | 0.007237 | 0.005965 | 1.618351e-02 | 0.005816 | 0.018875 | 0.004362 | 0.017333 | 0.002773 | 0.000163 | 0.009822 | 0.023196 | 0.004084 | 0.004836 | 0.024631 | 0.009543 | 0.014921 | 0.001068 | 0.116938 | 0.164744 | 0.063407 | 0.065502 | 0.590561 | 0.573237 | 0.005333 | 0.000691 |
| Phone | 0.005255 | 0.006482 | 0.026641 | 0.034715 | 0.168404 | 0.182420 | 0.176766 | 0.399725 | 0.135697 | 0.290586 | 0.100618 | 0.049883 | 0.146202 | 1.499371e-01 | 0.117825 | 0.069794 | 0.344013 | 0.060370 | 0.202200 | 0.112766 | 0.006433 | 0.278676 | 0.212891 | 0.232575 | 0.339406 | 0.210951 | 0.171218 | 0.153308 | 0.021317 | 0.010035 | 0.013995 | 0.009667 | 0.022739 | 0.025996 | 0.004043 | 0.000289 |
| Gender | 0.006606 | 0.000282 | 0.025137 | 0.005140 | 0.163742 | 0.288456 | 0.350118 | 0.100922 | 0.092735 | 0.014567 | 0.002414 | 0.093925 | 0.014775 | 3.083093e-01 | 0.314902 | 0.280188 | 0.175608 | 0.308175 | 0.049535 | 0.104759 | 0.207308 | 0.163313 | 0.050378 | 0.268569 | 0.350537 | 0.108632 | 0.189274 | 0.074078 | 0.016884 | 0.050651 | 0.037992 | 0.038734 | 0.021235 | 0.006307 | 0.006249 | 0.008047 |
| Children | 0.004062 | 0.006996 | 0.022799 | 0.027233 | 0.256007 | 0.357658 | 0.068468 | 0.142938 | 0.174139 | 0.136547 | 0.004814 | 0.293544 | 0.036262 | 9.335520e-02 | 0.283808 | 0.135861 | 0.270739 | 0.331444 | 0.180660 | 0.206500 | 0.067968 | 0.054321 | 0.032644 | 0.164450 | 0.244374 | 0.043262 | 0.360496 | 0.220105 | 0.017029 | 0.013750 | 0.012019 | 0.016564 | 0.020601 | 0.001534 | 0.002981 | 0.021366 |
| PaymentMethod | 0.002029 | 0.005287 | 0.018139 | 0.010812 | 0.209343 | 0.250987 | 0.388707 | 0.014036 | 0.080267 | 0.214929 | 0.126404 | 0.172566 | 0.071181 | 3.435017e-02 | 0.152159 | 0.254798 | 0.259314 | 0.069493 | 0.285499 | 0.104767 | 0.037658 | 0.208248 | 0.425729 | 0.146656 | 0.183318 | 0.149525 | 0.194293 | 0.204256 | 0.005077 | 0.016073 | 0.009213 | 0.002017 | 0.025834 | 0.011181 | 0.001173 | 0.000104 |
| Yearly_equip_failure | 0.007586 | 0.010157 | 0.018031 | 0.005670 | 0.025611 | 0.053246 | 0.049726 | 0.068388 | 0.214054 | 0.317419 | 0.260415 | 0.275090 | 0.380029 | 3.222771e-01 | 0.061476 | 0.161549 | 0.159293 | 0.044798 | 0.065507 | 0.084702 | 0.310644 | 0.273128 | 0.001954 | 0.010990 | 0.252149 | 0.327593 | 0.133839 | 0.142464 | 0.024064 | 0.010316 | 0.016452 | 0.002457 | 0.007314 | 0.020487 | 0.001928 | 0.000273 |
| Item1 | 0.458524 | 0.026273 | 0.018001 | 0.279258 | 0.015098 | 0.002382 | 0.015521 | 0.003372 | 0.008404 | 0.009678 | 0.012817 | 0.000007 | 0.010226 | 1.586049e-02 | 0.015139 | 0.021983 | 0.013283 | 0.006503 | 0.020878 | 0.007836 | 0.021052 | 0.036144 | 0.006383 | 0.005827 | 0.015785 | 0.021993 | 0.005731 | 0.017454 | 0.065534 | 0.112216 | 0.042154 | 0.024720 | 0.241617 | 0.792185 | 0.007132 | 0.001070 |
| Item8 | 0.308510 | 0.005864 | 0.016932 | 0.130530 | 0.035341 | 0.009377 | 0.027214 | 0.030253 | 0.014448 | 0.022980 | 0.011350 | 0.002542 | 0.018398 | 5.557681e-02 | 0.038701 | 0.024235 | 0.015398 | 0.029439 | 0.038344 | 0.000337 | 0.052102 | 0.030194 | 0.084024 | 0.090310 | 0.019159 | 0.130421 | 0.019693 | 0.043855 | 0.904849 | 0.024232 | 0.109246 | 0.044461 | 0.042343 | 0.043953 | 0.002552 | 0.000003 |
| InternetService | 0.004945 | 0.090356 | 0.014500 | 0.008043 | 0.178073 | 0.048197 | 0.154316 | 0.086865 | 0.325788 | 0.069635 | 0.023365 | 0.441653 | 0.032101 | 1.156500e-01 | 0.105920 | 0.262142 | 0.333731 | 0.105942 | 0.077239 | 0.025476 | 0.312410 | 0.211740 | 0.018183 | 0.332911 | 0.143811 | 0.315136 | 0.040776 | 0.043694 | 0.114230 | 0.005625 | 0.005357 | 0.022490 | 0.014675 | 0.010221 | 0.038954 | 0.058191 |
| Age | 0.006543 | 0.005773 | 0.014261 | 0.028575 | 0.314819 | 0.325885 | 0.134757 | 0.208302 | 0.210461 | 0.146269 | 0.089118 | 0.022460 | 0.055719 | 1.052215e-01 | 0.330820 | 0.132331 | 0.241096 | 0.199989 | 0.040361 | 0.101898 | 0.205941 | 0.182387 | 0.282448 | 0.280314 | 0.009027 | 0.033324 | 0.409293 | 0.021801 | 0.051598 | 0.001223 | 0.005929 | 0.006924 | 0.003078 | 0.013807 | 0.003886 | 0.021539 |
| 0.008841 | 0.016370 | 0.014198 | 0.005790 | 0.298975 | 0.099497 | 0.044126 | 0.305300 | 0.290185 | 0.127826 | 0.050999 | 0.129703 | 0.322733 | 2.754287e-01 | 0.081214 | 0.038530 | 0.195444 | 0.064523 | 0.015899 | 0.224922 | 0.304196 | 0.204382 | 0.065352 | 0.094016 | 0.192573 | 0.337172 | 0.099768 | 0.303420 | 0.050271 | 0.018479 | 0.017296 | 0.007168 | 0.016746 | 0.001281 | 0.000764 | 0.000044 | |
| Job | 0.001041 | 0.003845 | 0.013714 | 0.030533 | 0.326489 | 0.116705 | 0.027257 | 0.150627 | 0.208334 | 0.020077 | 0.163166 | 0.217466 | 0.161698 | 2.570319e-01 | 0.084639 | 0.307621 | 0.003381 | 0.381404 | 0.038929 | 0.123943 | 0.157232 | 0.440735 | 0.062775 | 0.292292 | 0.192048 | 0.059069 | 0.004861 | 0.172753 | 0.006745 | 0.042787 | 0.022446 | 0.010596 | 0.000615 | 0.006453 | 0.003892 | 0.000126 |
| Item6 | 0.404580 | 0.000825 | 0.013588 | 0.182438 | 0.001838 | 0.004540 | 0.014724 | 0.013394 | 0.013183 | 0.031090 | 0.011469 | 0.006341 | 0.007754 | 4.659514e-03 | 0.021548 | 0.033830 | 0.007063 | 0.011197 | 0.005965 | 0.021638 | 0.004587 | 0.031776 | 0.020607 | 0.003433 | 0.003579 | 0.012448 | 0.012066 | 0.019104 | 0.064651 | 0.062096 | 0.759840 | 0.393196 | 0.228714 | 0.065517 | 0.000357 | 0.000281 |
| Techie | 0.008068 | 0.007134 | 0.013185 | 0.023330 | 0.177039 | 0.350099 | 0.081050 | 0.060867 | 0.005356 | 0.028855 | 0.247121 | 0.061875 | 0.302280 | 3.610458e-07 | 0.262829 | 0.316309 | 0.129207 | 0.371535 | 0.208468 | 0.142494 | 0.241540 | 0.127340 | 0.287590 | 0.089082 | 0.004429 | 0.122128 | 0.254487 | 0.208787 | 0.040807 | 0.006284 | 0.001597 | 0.003068 | 0.000914 | 0.001400 | 0.003293 | 0.000381 |
| Tablet | 0.016426 | 0.009390 | 0.010142 | 0.000230 | 0.192382 | 0.205423 | 0.145711 | 0.446454 | 0.147720 | 0.170558 | 0.176285 | 0.006110 | 0.226207 | 1.859874e-01 | 0.077491 | 0.240589 | 0.081161 | 0.030748 | 0.086121 | 0.208545 | 0.298558 | 0.061074 | 0.200186 | 0.141464 | 0.196082 | 0.331028 | 0.040017 | 0.324029 | 0.011695 | 0.010138 | 0.004644 | 0.030268 | 0.028589 | 0.008808 | 0.003687 | 0.000150 |
| Marital | 0.001545 | 0.000925 | 0.007686 | 0.046163 | 0.347636 | 0.172469 | 0.138160 | 0.144941 | 0.095676 | 0.207509 | 0.053805 | 0.117624 | 0.232683 | 1.672024e-01 | 0.009986 | 0.386828 | 0.224922 | 0.148864 | 0.056956 | 0.174930 | 0.084927 | 0.324999 | 0.368358 | 0.310863 | 0.106104 | 0.164987 | 0.060333 | 0.040828 | 0.097383 | 0.004525 | 0.013683 | 0.025600 | 0.019096 | 0.009077 | 0.000276 | 0.000269 |
| Income | 0.001293 | 0.003168 | 0.007672 | 0.025195 | 0.044314 | 0.094008 | 0.398118 | 0.003895 | 0.052135 | 0.246179 | 0.262052 | 0.355212 | 0.007184 | 1.616457e-02 | 0.010952 | 0.360848 | 0.051042 | 0.265198 | 0.222417 | 0.141875 | 0.143951 | 0.166202 | 0.225633 | 0.143254 | 0.301995 | 0.040970 | 0.086480 | 0.266116 | 0.018776 | 0.083975 | 0.008254 | 0.000438 | 0.004426 | 0.013636 | 0.001500 | 0.000309 |
| Contract | 0.001566 | 0.028128 | 0.005922 | 0.019875 | 0.156791 | 0.234927 | 0.197011 | 0.082019 | 0.245649 | 0.232078 | 0.141509 | 0.251632 | 0.203228 | 3.704564e-01 | 0.206058 | 0.057373 | 0.041128 | 0.193069 | 0.111547 | 0.029406 | 0.030773 | 0.100707 | 0.478787 | 0.090675 | 0.229960 | 0.085417 | 0.284625 | 0.065068 | 0.066232 | 0.016827 | 0.005478 | 0.017629 | 0.019097 | 0.001101 | 0.001354 | 0.000169 |
| Port_modem | 0.000960 | 0.010358 | 0.004830 | 0.014175 | 0.146567 | 0.255593 | 0.166854 | 0.025815 | 0.272646 | 0.023889 | 0.226382 | 0.010529 | 0.068306 | 5.111927e-02 | 0.041160 | 0.238422 | 0.195845 | 0.056817 | 0.492131 | 0.588980 | 0.016904 | 0.166462 | 0.014630 | 0.014800 | 0.141252 | 0.040310 | 0.099861 | 0.035562 | 0.003238 | 0.007392 | 0.004799 | 0.013782 | 0.008367 | 0.001230 | 0.001215 | 0.000298 |
| Contacts | 0.009033 | 0.007482 | 0.004506 | 0.010949 | 0.216367 | 0.195649 | 0.296832 | 0.271232 | 0.252826 | 0.152608 | 0.068084 | 0.109572 | 0.204284 | 4.976584e-02 | 0.075945 | 0.096895 | 0.425239 | 0.303393 | 0.019539 | 0.125046 | 0.201635 | 0.055289 | 0.038369 | 0.301196 | 0.083898 | 0.105951 | 0.119369 | 0.356904 | 0.052216 | 0.037752 | 0.002309 | 0.022767 | 0.019084 | 0.001903 | 0.000528 | 0.000159 |
| Item7 | 0.358079 | 0.001028 | 0.001613 | 0.179957 | 0.004528 | 0.016178 | 0.023696 | 0.008354 | 0.023578 | 0.014948 | 0.000795 | 0.018527 | 0.023550 | 6.017067e-02 | 0.061499 | 0.022301 | 0.003932 | 0.049531 | 0.030897 | 0.002225 | 0.011965 | 0.004971 | 0.010907 | 0.061496 | 0.048440 | 0.018215 | 0.025681 | 0.002814 | 0.157963 | 0.799398 | 0.377580 | 0.073599 | 0.066508 | 0.040740 | 0.000834 | 0.000126 |
| PaperlessBilling | 0.005597 | 0.003479 | 0.001231 | 0.007867 | 0.302491 | 0.108702 | 0.219093 | 0.151260 | 0.254162 | 0.322588 | 0.063646 | 0.223036 | 0.170612 | 2.085504e-01 | 0.004918 | 0.114146 | 0.237857 | 0.009438 | 0.005464 | 0.205505 | 0.218110 | 0.253008 | 0.010926 | 0.367496 | 0.089512 | 0.303847 | 0.203473 | 0.162980 | 0.028953 | 0.037742 | 0.031559 | 0.003987 | 0.014685 | 0.008927 | 0.005315 | 0.000077 |
PC3 Main contributors sorted: MonthlyCharge 0.602709 StreamingMovies 0.394214 Tenure 0.384277 Bandwidth_GB_Year 0.339510 StreamingTV 0.300744 Multiple 0.263199 OnlineBackup 0.147357 DeviceProtection 0.138106 TechSupport 0.070719 Item5 0.054942 Item4 0.039026 Item3 0.032689 OnlineSecurity 0.032530 Outage_sec_perweek 0.029656 Item2 0.027409 Phone 0.026641 Gender 0.025137 Children 0.022799 PaymentMethod 0.018139 Yearly_equip_failure 0.018031 Item1 0.018001 Item8 0.016932 InternetService 0.014500 Age 0.014261 Email 0.014198 Job 0.013714 Item6 0.013588 Techie 0.013185 Tablet 0.010142 Marital 0.007686 Income 0.007672 Contract 0.005922 Port_modem 0.004830 Contacts 0.004506 Item7 0.001613 PaperlessBilling 0.001231 Name: PC3, dtype: float64
data preparation goals:
# selecting only the numeric columns
numerics = ['uint8','uint32','int16', 'int32', 'int64', 'float16', 'float32', 'float64']
df_num_only = df_raw.select_dtypes(include=numerics) # selecting only the numeric columns
lr_df=df_num_only
Initializing Test model
X=lr_df.copy()
# adding categorical columns excluded earlier
catcols=[coln for coln in list(df_raw.columns) if coln not in list(X.columns)]
X[catcols]=df_raw.loc[:,catcols]
X=cat2num(X) # To be able to correlate the categorical variables
X= X.select_dtypes(include=numerics) # removing the non-numeric columns
X=X.drop([target],axis=1) #dropping te target column from the predictors data set
X = sm.add_constant(X) #adding constant that will represent the intercept of the LR model
y = df_num[target] # assigning the target to vector y
# Splitting the data into train and test portions
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.20)
%%time
# LR implementation
logistic_regression = sm.Logit(y_train,X_train)
fitted_model = logistic_regression.fit()
predictions =fitted_model.predict(X_test)
print(fitted_model.summary())
Optimization terminated successfully.
Current function value: 0.225160
Iterations 9
Logit Regression Results
==============================================================================
Dep. Variable: Churn No. Observations: 8000
Model: Logit Df Residuals: 7953
Method: MLE Df Model: 46
Date: Thu, 15 Apr 2021 Pseudo R-squ.: 0.6130
Time: 13:12:07 Log-Likelihood: -1801.3
converged: True LL-Null: -4654.1
Covariance Type: nonrobust LLR p-value: 0.000
========================================================================================
coef std err z P>|z| [0.025 0.975]
----------------------------------------------------------------------------------------
const -7.4082 0.889 -8.336 0.000 -9.150 -5.666
CaseOrder -1.701e-06 2.68e-05 -0.063 0.949 -5.42e-05 5.08e-05
Zip -1.95e-06 4.58e-06 -0.426 0.670 -1.09e-05 7.02e-06
Lat 0.0020 0.009 0.224 0.822 -0.015 0.019
Lng -0.0043 0.007 -0.575 0.565 -0.019 0.010
Population -1.986e-07 3.14e-06 -0.063 0.950 -6.36e-06 5.96e-06
Children -0.1847 0.027 -6.929 0.000 -0.237 -0.132
Age 0.0235 0.003 8.490 0.000 0.018 0.029
Income 6.017e-07 1.5e-06 0.400 0.689 -2.34e-06 3.55e-06
Outage_sec_perweek 0.0061 0.014 0.431 0.667 -0.022 0.034
Email -0.0009 0.014 -0.068 0.946 -0.028 0.026
Contacts 0.0845 0.043 1.965 0.049 0.000 0.169
Yearly_equip_failure -0.0672 0.068 -0.989 0.322 -0.200 0.066
Tenure -0.6332 0.047 -13.488 0.000 -0.725 -0.541
MonthlyCharge 0.0259 0.004 6.686 0.000 0.018 0.034
Bandwidth_GB_Year 0.0063 0.001 11.379 0.000 0.005 0.007
Item1 0.0031 0.060 0.052 0.959 -0.115 0.121
Item2 0.0035 0.057 0.061 0.951 -0.108 0.115
Item3 -0.0283 0.052 -0.549 0.583 -0.129 0.073
Item4 -0.0029 0.046 -0.064 0.949 -0.093 0.088
Item5 -0.0351 0.049 -0.718 0.473 -0.131 0.061
Item6 -0.0232 0.050 -0.467 0.641 -0.121 0.074
Item7 0.0205 0.047 0.434 0.664 -0.072 0.113
Item8 -0.0439 0.044 -0.990 0.322 -0.131 0.043
City -2.484e-06 2.41e-05 -0.103 0.918 -4.98e-05 4.48e-05
State 0.0043 0.003 1.403 0.161 -0.002 0.010
County -8.754e-06 9.46e-05 -0.093 0.926 -0.000 0.000
Area -0.0511 0.052 -0.979 0.328 -0.153 0.051
TimeZone 0.0051 0.008 0.617 0.537 -0.011 0.021
Job -0.0005 0.000 -2.292 0.022 -0.001 -7.76e-05
Marital 0.0561 0.030 1.876 0.061 -0.003 0.115
Gender -0.1478 0.083 -1.787 0.074 -0.310 0.014
Techie 1.0169 0.111 9.173 0.000 0.800 1.234
Contract -1.9386 0.072 -26.785 0.000 -2.080 -1.797
Port_modem 0.1054 0.085 1.243 0.214 -0.061 0.272
Tablet -0.1198 0.093 -1.293 0.196 -0.301 0.062
InternetService 0.7804 0.149 5.252 0.000 0.489 1.072
Phone -0.4181 0.146 -2.856 0.004 -0.705 -0.131
Multiple 0.2608 0.177 1.471 0.141 -0.087 0.608
OnlineSecurity -0.6861 0.102 -6.723 0.000 -0.886 -0.486
OnlineBackup -0.3967 0.151 -2.627 0.009 -0.693 -0.101
DeviceProtection -0.5358 0.123 -4.371 0.000 -0.776 -0.296
TechSupport -0.1315 0.102 -1.295 0.195 -0.331 0.068
StreamingTV 0.2410 0.279 0.864 0.388 -0.306 0.788
StreamingMovies 0.6366 0.305 2.089 0.037 0.039 1.234
PaperlessBilling 0.1556 0.087 1.797 0.072 -0.014 0.325
PaymentMethod 0.1270 0.040 3.202 0.001 0.049 0.205
========================================================================================
Wall time: 271 ms
print(confusion_matrix(y_test, predictions>0.5))
[[1396 102] [ 124 378]]
modelnumber=0
report=''
report=str("model "+str(modelnumber))+ classification_report(y_test,(predictions >=0.5).astype('int'))+'\n'+("-"*100)+'\n'+report
print(report)
# LR examination plots
sns.countplot(x = (fitted_model.predict(X_test)>=0.5).astype('int'),hue = y_test,alpha=0.7)
plt.xlabel('prediction')
plt.show()
residuals=((fitted_model.predict(X_test)>=0.5).astype('int'))- y_test
residuals.hist(bins = 5)
plt.title('Residuals')
plt.xlabel('residual(std.)')
plt.show()
plt.scatter(X_test.index,residuals,marker='o',s=20,alpha=10,c=y_test,cmap=sns.diverging_palette(250, 30, l=65, center="dark", as_cmap=True))
plt.title('Residuals positions (colored by "'+ target+'")')
plt.xlabel('index')
plt.ylabel('residuals')
plt.colorbar();plt.grid('on')
plt.show()
model 0 precision recall f1-score support
0 0.92 0.93 0.93 1498
1 0.79 0.75 0.77 502
accuracy 0.89 2000
macro avg 0.85 0.84 0.85 2000
weighted avg 0.89 0.89 0.89 2000
----------------------------------------------------------------------------------------------------
fitted_model.pvalues.sort_values(ascending=True)
Contract 4.810898e-158 Tenure 1.844850e-41 Bandwidth_GB_Year 5.311655e-30 Techie 4.584947e-20 Age 2.075125e-17 const 7.698112e-17 Children 4.243508e-12 OnlineSecurity 1.785841e-11 MonthlyCharge 2.287992e-11 InternetService 1.501303e-07 DeviceProtection 1.236988e-05 PaymentMethod 1.365064e-03 Phone 4.293930e-03 OnlineBackup 8.608209e-03 Job 2.192389e-02 StreamingMovies 3.666625e-02 Contacts 4.937141e-02 Marital 6.062221e-02 PaperlessBilling 7.227342e-02 Gender 7.387787e-02 Multiple 1.411925e-01 State 1.606102e-01 TechSupport 1.952879e-01 Tablet 1.960587e-01 Port_modem 2.138644e-01 Item8 3.222154e-01 Yearly_equip_failure 3.224501e-01 Area 3.275020e-01 StreamingTV 3.877089e-01 Item5 4.725855e-01 TimeZone 5.369447e-01 Lng 5.649780e-01 Item3 5.827279e-01 Item6 6.408403e-01 Item7 6.640254e-01 Outage_sec_perweek 6.667268e-01 Zip 6.700882e-01 Income 6.888299e-01 Lat 8.224973e-01 City 9.180030e-01 County 9.262980e-01 Email 9.460898e-01 Item4 9.490846e-01 CaseOrder 9.493778e-01 Population 9.495821e-01 Item2 9.509732e-01 Item1 9.585714e-01 dtype: float64
signific=0.05
test_significant_columns = [fitted_model.pvalues.index[key] for (key,value) in enumerate(fitted_model.pvalues) if value <signific]
test_nonsign_columns = [fitted_model.pvalues.index[key] for (key,value) in enumerate(fitted_model.pvalues) if value >=signific]
print('test Significant columns: \n',test_significant_columns)
print('\n test Non Significant columns: \n',test_nonsign_columns)
test Significant columns: ['const', 'Children', 'Age', 'Contacts', 'Tenure', 'MonthlyCharge', 'Bandwidth_GB_Year', 'Job', 'Techie', 'Contract', 'InternetService', 'Phone', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'StreamingMovies', 'PaymentMethod'] test Non Significant columns: ['CaseOrder', 'Zip', 'Lat', 'Lng', 'Population', 'Income', 'Outage_sec_perweek', 'Email', 'Yearly_equip_failure', 'Item1', 'Item2', 'Item3', 'Item4', 'Item5', 'Item6', 'Item7', 'Item8', 'City', 'State', 'County', 'Area', 'TimeZone', 'Marital', 'Gender', 'Port_modem', 'Tablet', 'Multiple', 'TechSupport', 'StreamingTV', 'PaperlessBilling']
X_temp=X.copy()
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X_temp.values, i) for i in range(X_temp.values.shape[1])]
vif["features"] = X_temp.columns
print(vif.round(1).sort_values(by="VIF Factor",ascending=False))
VIF Factor features 15 722.5 Bandwidth_GB_Year 13 712.7 Tenure 0 449.4 const 14 13.2 MonthlyCharge 44 11.3 StreamingMovies 43 9.4 StreamingTV 2 9.4 Zip 4 7.9 Lng 36 6.0 InternetService 38 3.9 Multiple 1 3.3 CaseOrder 40 2.9 OnlineBackup 16 2.2 Item1 17 1.9 Item2 41 1.9 DeviceProtection 28 1.8 TimeZone 7 1.7 Age 6 1.7 Children 18 1.6 Item3 21 1.5 Item6 20 1.4 Item5 22 1.3 Item7 42 1.3 TechSupport 19 1.3 Item4 39 1.3 OnlineSecurity 3 1.3 Lat 23 1.2 Item8 5 1.1 Population 31 1.1 Gender 25 1.1 State 27 1.0 Area 24 1.0 City 45 1.0 PaperlessBilling 8 1.0 Income 9 1.0 Outage_sec_perweek 10 1.0 Email 11 1.0 Contacts 12 1.0 Yearly_equip_failure 37 1.0 Phone 26 1.0 County 35 1.0 Tablet 34 1.0 Port_modem 33 1.0 Contract 32 1.0 Techie 30 1.0 Marital 29 1.0 Job 46 1.0 PaymentMethod
X_temp=X.copy()
droplist=['Age','Tenure','Job','State','PaperlessBilling','Phone','OnlineSecurity','DeviceProtection','Port_modem']# list of features to remove based on VIF and modelling iterations
X_temp=X_temp.drop(droplist,axis=1)
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X_temp.values, i) for i in range(X_temp.values.shape[1])]
vif["features"] = X_temp.columns
print(vif.round(1).sort_values(by="VIF Factor",ascending=False))
VIF Factor features 0 292.8 const 2 9.3 Zip 4 7.6 Lng 12 6.5 MonthlyCharge 36 3.5 StreamingMovies 13 3.3 Bandwidth_GB_Year 1 3.2 CaseOrder 35 2.6 StreamingTV 14 2.2 Item1 32 2.0 Multiple 15 1.9 Item2 25 1.8 TimeZone 16 1.6 Item3 33 1.5 OnlineBackup 19 1.5 Item6 18 1.4 Item5 17 1.3 Item4 3 1.3 Lat 20 1.3 Item7 21 1.2 Item8 5 1.1 Population 34 1.1 TechSupport 31 1.1 InternetService 11 1.0 Yearly_equip_failure 29 1.0 Contract 6 1.0 Children 7 1.0 Income 30 1.0 Tablet 26 1.0 Marital 28 1.0 Techie 27 1.0 Gender 10 1.0 Contacts 8 1.0 Outage_sec_perweek 24 1.0 Area 23 1.0 County 22 1.0 City 9 1.0 Email 37 1.0 PaymentMethod
initial_significant_columns = [fitted_model.pvalues.index[key] for (key,value) in enumerate(fitted_model.pvalues) if value <signific]
initial_nonsign_columns = [fitted_model.pvalues.index[key] for (key,value) in enumerate(fitted_model.pvalues) if value >=signific]
initial_nonsign_columns.extend(test_nonsign_columns)
initial_nonsign_columns.extend(droplist)
initial_nonsign_columns = list(set(initial_nonsign_columns))
initial_significant_columns = [col for col in initial_significant_columns if col not in initial_nonsign_columns]
print('initial Significant columns: \n',initial_significant_columns)
print('\n initial Non Significant columns: \n',initial_nonsign_columns)
initial Significant columns: ['const', 'Children', 'Contacts', 'MonthlyCharge', 'Bandwidth_GB_Year', 'Techie', 'Contract', 'InternetService', 'OnlineBackup', 'StreamingMovies', 'PaymentMethod'] initial Non Significant columns: ['Tenure', 'Tablet', 'TechSupport', 'TimeZone', 'Email', 'Item3', 'Zip', 'City', 'StreamingTV', 'Outage_sec_perweek', 'Item4', 'Area', 'Item7', 'Lng', 'County', 'Marital', 'Item1', 'Yearly_equip_failure', 'Job', 'Item2', 'DeviceProtection', 'State', 'Age', 'CaseOrder', 'Multiple', 'Item8', 'Population', 'Income', 'Port_modem', 'Phone', 'OnlineSecurity', 'PaperlessBilling', 'Item6', 'Lat', 'Gender', 'Item5']
D208_t2_input=X.copy()
D208_t2_input['Churn']=y
D208_t2_input.to_csv('D208_t2_model_input.csv', index = False)
%%time
# initializing the model
X=lr_df.copy()
# adding categorical columns excluded earlier
catcols=[coln for coln in list(df_raw.columns) if coln not in list(X.columns)]
X[catcols]=df_raw.loc[:,catcols]
X = sm.add_constant(X) #adding constant that will represent the intercept of the LR model
X=X[initial_significant_columns] #dropping nonsignificant columns from the predictors data set
# Adding dummy variables
dummy_cols = [dummy for dummy in categ_variables if dummy in X.columns]
X=pd.get_dummies(X,drop_first=True,columns=dummy_cols)
X=cat2num(X) # To be able to correlate the categorical variables
X= X.select_dtypes(include=numerics) # removing the non-numeric columns
y = df_num[target] # assigning the target to vector y
# Splitting the data into train and test portions
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.20)
# LR implementation
logistic_regression = sm.Logit(y_train,X_train)
fitted_model = logistic_regression.fit()
predictions =fitted_model.predict(X_test)
print(fitted_model.summary())
Optimization terminated successfully.
Current function value: 0.225836
Iterations 9
Logit Regression Results
==============================================================================
Dep. Variable: Churn No. Observations: 8000
Model: Logit Df Residuals: 7985
Method: MLE Df Model: 14
Date: Thu, 15 Apr 2021 Pseudo R-squ.: 0.6104
Time: 13:14:33 Log-Likelihood: -1806.7
converged: True LL-Null: -4637.0
Covariance Type: nonrobust LLR p-value: 0.000
=========================================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------------------------------
const -6.4380 0.274 -23.537 0.000 -6.974 -5.902
Children 0.0520 0.020 2.654 0.008 0.014 0.090
Contacts 0.0532 0.042 1.260 0.208 -0.030 0.136
MonthlyCharge 0.0614 0.002 30.454 0.000 0.057 0.065
Bandwidth_GB_Year -0.0014 3.83e-05 -35.461 0.000 -0.001 -0.001
Techie_Yes 0.9917 0.112 8.820 0.000 0.771 1.212
Contract_One year -3.2331 0.138 -23.506 0.000 -3.503 -2.963
Contract_Two Year -3.3948 0.136 -24.888 0.000 -3.662 -3.127
InternetService_Fiber Optic -3.1804 0.124 -25.608 0.000 -3.424 -2.937
InternetService_None -1.2929 0.123 -10.502 0.000 -1.534 -1.052
OnlineBackup_Yes -0.4821 0.091 -5.316 0.000 -0.660 -0.304
StreamingMovies_Yes 0.3545 0.111 3.195 0.001 0.137 0.572
PaymentMethod_Credit Card (automatic) 0.1767 0.129 1.372 0.170 -0.076 0.429
PaymentMethod_Electronic Check 0.6246 0.116 5.402 0.000 0.398 0.851
PaymentMethod_Mailed Check 0.2309 0.126 1.837 0.066 -0.015 0.477
=========================================================================================================
Wall time: 337 ms
modelnumber=modelnumber+1
report=str("model "+str(modelnumber))+ classification_report(y_test,(predictions >=0.5).astype('int'))+'\n'+("-"*100)+'\n'+report
print(report)
# LR examination plots
sns.countplot(x = (fitted_model.predict(X_test)>=0.5).astype('int'),hue = y_test,alpha=0.7)
plt.xlabel('prediction')
plt.show()
residuals=((fitted_model.predict(X_test)>=0.5).astype('int'))- y_test
residuals.hist(bins = 5)
plt.title('Residuals')
plt.xlabel('residual(std.)')
plt.show()
plt.scatter(X_test.index,residuals,marker='o',s=20,alpha=10,c=y_test,cmap=sns.diverging_palette(250, 30, l=65, center="dark", as_cmap=True))
plt.title('Residuals positions (colored by "'+ target+'")')
plt.xlabel('index')
plt.ylabel('residuals')
plt.colorbar();plt.grid('on')
plt.show()
model 1 precision recall f1-score support
0 0.93 0.94 0.93 1481
1 0.82 0.78 0.80 519
accuracy 0.90 2000
macro avg 0.87 0.86 0.87 2000
weighted avg 0.90 0.90 0.90 2000
----------------------------------------------------------------------------------------------------
model 0 precision recall f1-score support
0 0.92 0.93 0.93 1498
1 0.79 0.75 0.77 502
accuracy 0.89 2000
macro avg 0.85 0.84 0.85 2000
weighted avg 0.89 0.89 0.89 2000
----------------------------------------------------------------------------------------------------
Mainly based on statistical significance (P>|z|) value, with significance limit= 0.05 , the variables with P-value less than 0.05 represnt the rejection of the null hypthesis , mening that these variables are significant and relevant to the research question (prediction of 'Churn'). variables with with P-value larger than 0.05 represent failure to reject the null hypthesis meaning that the probability of giving them 0 coeeficient (or excluding them from the model) can be a relevant decesion.
Note: The output should include a screenshot of each model.
significant_columns = [fitted_model.pvalues.index[key] for (key,value) in enumerate(fitted_model.pvalues) if value <signific]
nonsign_columns = [fitted_model.pvalues.index[key] for (key,value) in enumerate(fitted_model.pvalues) if value >=signific]
nonsign_columns.extend(test_nonsign_columns)
nonsign_columns.extend(initial_nonsign_columns)
nonsign_columns.extend(droplist)
nonsign_columns = list(set(initial_nonsign_columns))
significant_columns = [col for col in significant_columns if col not in nonsign_columns]
print('initial Significant columns: \n',significant_columns)
print('\n initial Non Significant columns: \n',nonsign_columns)
initial Significant columns: ['const', 'Children', 'MonthlyCharge', 'Bandwidth_GB_Year', 'Techie_Yes', 'Contract_One year', 'Contract_Two Year', 'InternetService_Fiber Optic', 'InternetService_None', 'OnlineBackup_Yes', 'StreamingMovies_Yes', 'PaymentMethod_Electronic Check'] initial Non Significant columns: ['Tenure', 'Tablet', 'TechSupport', 'TimeZone', 'Email', 'Item3', 'Zip', 'City', 'StreamingTV', 'Outage_sec_perweek', 'Item4', 'Area', 'Item7', 'Lng', 'County', 'Marital', 'Item1', 'Yearly_equip_failure', 'Job', 'Item2', 'DeviceProtection', 'State', 'Age', 'CaseOrder', 'Multiple', 'Item8', 'Population', 'Income', 'Port_modem', 'Phone', 'OnlineSecurity', 'PaperlessBilling', 'Item6', 'Lat', 'Gender', 'Item5']
%%time
# initializing the model
X=lr_df.copy()
# adding categorical columns excluded earlier
catcols=[coln for coln in list(df_raw.columns) if coln not in list(X.columns)]
X[catcols]=df_raw.loc[:,catcols]
X = sm.add_constant(X) #adding constant that will represent the intercept of the LR model
# Adding dummy variables
dummy_cols = [dummy for dummy in categ_variables if dummy in X.columns]
X=pd.get_dummies(X,drop_first=True,columns=dummy_cols)
X=X[significant_columns] #dropping nonsignificant columns from the predictors data set
X=cat2num(X) # To be able to correlate the categorical variables
X= X.select_dtypes(include=numerics) # removing the non-numeric columns
y = df_num[target] # assigning the target to vector y
# Splitting the data into train and test portions
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.20)
# LR implementation
logistic_regression = sm.Logit(y_train,X_train)
fitted_model = logistic_regression.fit()
predictions =fitted_model.predict(X_test)
print(fitted_model.summary())
Optimization terminated successfully.
Current function value: 0.228375
Iterations 9
Logit Regression Results
==============================================================================
Dep. Variable: Churn No. Observations: 8000
Model: Logit Df Residuals: 7988
Method: MLE Df Model: 11
Date: Thu, 15 Apr 2021 Pseudo R-squ.: 0.6048
Time: 13:15:20 Log-Likelihood: -1827.0
converged: True LL-Null: -4622.7
Covariance Type: nonrobust LLR p-value: 0.000
==================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------
const -6.2657 0.252 -24.897 0.000 -6.759 -5.772
Children 0.0621 0.020 3.149 0.002 0.023 0.101
MonthlyCharge 0.0603 0.002 30.562 0.000 0.056 0.064
Bandwidth_GB_Year -0.0013 3.76e-05 -35.577 0.000 -0.001 -0.001
Techie_Yes 1.0605 0.111 9.563 0.000 0.843 1.278
Contract_One year -3.2329 0.137 -23.600 0.000 -3.501 -2.964
Contract_Two Year -3.3642 0.135 -24.895 0.000 -3.629 -3.099
InternetService_Fiber Optic -2.9902 0.121 -24.666 0.000 -3.228 -2.753
InternetService_None -1.1145 0.120 -9.258 0.000 -1.350 -0.879
OnlineBackup_Yes -0.4477 0.090 -4.987 0.000 -0.624 -0.272
StreamingMovies_Yes 0.3377 0.110 3.060 0.002 0.121 0.554
PaymentMethod_Electronic Check 0.4957 0.089 5.541 0.000 0.320 0.671
==================================================================================================
Wall time: 6.77 s
modelnumber=modelnumber+1
report=str("model "+str(modelnumber))+ classification_report(y_test,(predictions >=0.5).astype('int'))+'\n'+("-"*100)+'\n'+report
print(report)
# LR examination plots
sns.countplot(x = (fitted_model.predict(X_test)>=0.5).astype('int'),hue = y_test,alpha=0.7)
plt.xlabel('prediction')
plt.show()
residuals=((fitted_model.predict(X_test)>=0.5).astype('int'))- y_test
residuals.hist(bins = 5)
plt.title('Residuals')
plt.xlabel('residual(std.)')
plt.show()
plt.scatter(X_test.index,residuals,marker='o',s=20,alpha=10,c=y_test,cmap=sns.diverging_palette(250, 30, l=65, center="dark", as_cmap=True))
plt.title('Residuals positions (colored by "'+ target+'")')
plt.xlabel('index')
plt.ylabel('residuals')
plt.colorbar();plt.grid('on')
plt.show()
model 2 precision recall f1-score support
0 0.93 0.94 0.93 1467
1 0.84 0.79 0.81 533
accuracy 0.90 2000
macro avg 0.88 0.87 0.87 2000
weighted avg 0.90 0.90 0.90 2000
----------------------------------------------------------------------------------------------------
model 1 precision recall f1-score support
0 0.93 0.94 0.93 1481
1 0.82 0.78 0.80 519
accuracy 0.90 2000
macro avg 0.87 0.86 0.87 2000
weighted avg 0.90 0.90 0.90 2000
----------------------------------------------------------------------------------------------------
model 0 precision recall f1-score support
0 0.92 0.93 0.93 1498
1 0.79 0.75 0.77 502
accuracy 0.89 2000
macro avg 0.85 0.84 0.85 2000
weighted avg 0.89 0.89 0.89 2000
----------------------------------------------------------------------------------------------------
%%time
#calling the plt_summary to the Final input/output DataFrame
df_tst = X.copy()
df_tst[target]=y>0.5
plt_summary(df_tst,"target_histplot",target)
Number of columns: 13
Wall time: 4.98 s
• the logic of the variable selection technique
Mainly based on statistical significance (P>|z|) value, with significance limit= 0.05 , the variables with P-value less than 0.05 represnt the rejection of the null hypthesis , mening that these variables are significant and relevant to the research question (prediction of 'Churn'). variables with with P-value larger than 0.05 represent failure to reject the null hypthesis meaning that the probability of giving them 0 coeeficient (or excluding them from the model) can be a relevant decesion.
The final set of variables has been chosen based on sequential iterations, by running several models, eliminating non-significant features, then rerunning the model again and so on, until reaching the final model, not all the runs were included in this notebook (this notebook includes only a test model, initial model and a final reduced model).
• the model evaluation metric
p-value is zero for almost all variables of the reduced model (all are significant)
The clasification report for the test portion of data, which includes:
Precision: counts of the percentage of correct classifications on one label.
Note: The output should include the predictions from the refined model you used to perform the analysis.
D208_t2_output=X.copy()
D208_t2_output['Churn']=y
D208_t2_output['pred-Churn']=(fitted_model.predict(X)>=0.5).astype('int')
D208_t2_output['residuals']=(fitted_model.predict(X)>=0.5).astype('int')-y
D208_t2_output.to_csv('D208_t2_model_output.csv', index = False)
Included in this notebook
# Final reg. model Equation:
equation = ''
for (feature,coef) in dict(fitted_model.params).items():
equation = equation+' + (' + feature +' * '+str(np.round(coef,3))+ ') '
print('Log of the odds of '+target+' = \n (',equation[12:])
Log of the odds of Churn = ( -6.266) + (Children * 0.062) + (MonthlyCharge * 0.06) + (Bandwidth_GB_Year * -0.001) + (Techie_Yes * 1.06) + (Contract_One year * -3.233) + (Contract_Two Year * -3.364) + (InternetService_Fiber Optic * -2.99) + (InternetService_None * -1.114) + (OnlineBackup_Yes * -0.448) + (StreamingMovies_Yes * 0.338) + (PaymentMethod_Electronic Check * 0.496)
The regression equation result represents the Log of the odds of 'Churn', meaning that the probability of churn is above 0.5 when the (Log of the odds of 'Churn') is positive, and the probability of churn is below 0.5 when the (Log of the odds of 'Churn') is negative, unstandardized inputs have been used for easier interpretation of the "Log of the odds" equation as following :
(note: *Some minor difference in the coef. values because of regenerating the models)
etc...
The model accuracy couldn't be enhanced by reduction.
Some observation values have been artificially created such as 'Tenure', 'MontlhyCharge' and 'Bandwidth' according to the (D207 D208 D209 Churn Data Consideration and Dictionary.pdf), meaning the presence of some bias in estimation specifically for the "brand new customers" (Heteroscedasticity), this can be noticed clearly from the position of residuals charts, the prediction for observations in the first half of the data set (index < 5000) wasn't as good as the predictions in the second half.
The number of observations used in training the model was very high (8000 observations), trying to decrease this number had negatively affected the model, meaning that the model can't be considered highly robust, despite the high accuracy score achieved (> 90%).
Based on the final model features and coefficients, some services and conditions are significantly contributing to 'customer Churn" probability positively or negatively.
StreamingTV and StreamingMovies services: Customers who use StreamingTV and StreamingMovies services are more likely to churn, such services should be reviewed and evaluated, as they negatively affect customer satisfaction.
Fiber Optic internet service: Customers who use this service are less likely to churn, this service should get marketed and sold more.
customers who have one year and two years contracts are less likely to churn, the company should encourage customers to have these two kinds of contracts.
PaymentMethod_Electronic Check : Customers who use this method for payment are more likely to churn, the company should encourage other payment options.
I. Acknowledge sources, using in-text citations and references, for content that is quoted, paraphrased, or summarized.
J. Demonstrate professional communication in the content and presentation of your submission.
1- Bruce, P. C., Bruce, A., & Gedeck, P. (2020). Practical statistics for data scientists: 50 essential concepts. Sebastopol, CA: O'Reilly Media, Incorporated.
3- (WGU) Predictive Modeling – D208 course materials and Labs
4- General Questions/Answers from https://stackoverflow.com/
5- Documentation of Python packages: pandas, matplotlib,numpy,seaborn and scipy